3 UMAP

We’ll use the mathemagical Uniform Manifold Approximation and Projection (UMAP) algorithm to project the already dimension-reduced data (150 singular vectors) into 2-space.

# svd_ump = umap(svd$v)
# save(svd_ump, file='svd_ump.RData')
load('svd_ump.RData')

fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
  add_trace(
    x = svd_ump$layout[,1],
    y = svd_ump$layout[,2],
    text = ~paste('heading:', head ,"$<br>text: ", raw_text  ),
    hoverinfo = 'text',
    marker = list(color='green'),
    showlegend = F
  )

fig

Outliers causing annoying viz issues requiring the zoom. We will routinely omit these outliers (after noting they make nice clusters of related documents) when creating the plot to avoid having to zoom on the main plot.

index_subset = abs(svd_ump$layout[,1]) <20 & abs(svd_ump$layout[,2]) <20
data_subset = svd_ump$layout[index_subset,]
raw_text_subset = raw_text[index_subset]
head_subset = head[index_subset]

fig <- plot_ly(type = 'scatter', mode = 'markers')
fig <- fig %>%
  add_trace(
    x = data_subset[,1],
    y = data_subset[,2],
    text = ~paste('heading:', head_subset ,"$<br>text: ", raw_text_subset ),
    hoverinfo = 'text',
    marker = list(color='green'),
    showlegend = F
  )

fig

After omitting the outliers, we see a nice plot that looks like it has some nice cluster separation.